Using the Advanced PDF Object

The Advanced PDF object in Advanced Process Automation enables you to capture text from a PDF document using an advanced OCR engine.

To capture text, images must have a resolution of at least 200 dpi, with a contrast of 50% brightness or more.

Each PDF file being loaded cannot be larger than 2GB.

OCR functionality is included in the Advanced PDF object, which can be found in the Direct.Vsd.Library. Use the Advanced PDF object for getting text and tables from the PDF only.

The following languages are supported: Arabic, Armenian, Azeri, Bashkir, Bulgarian, Catalan, Croatian, Czech, Danish, Dutch, Estonian, English, Farsi, Finnish, Dutch, French, German, Greek, Hebrew, Hungarian, Indonesian, Italian, Japanese, Korean, Latin, Latvian, Lithuanian, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Spanish, Swedish, Tatar, Thai, Turkish, Ukrainian, Vietnamese.

Advanced PDF Object Functionality

You can review the available functions of the Advanced PDF object from the Direct.Vsd.Library in the Real-Time Designer.

To view Direct.Vsd.Library functionality:

In the Real-Time Designer, select the Project tab.
In the Reference section, browse to Library References > Direct.Vsd.Library.
In the Functionality tab, from the Type drop-down list, select Advanced PDF.

The following properties are available:

Property	Description
Active Page	The page number of the page in the PDF from which to read data. The default value is 1.
Block Count	The number of OCR blocks on the Active Page of the document.
File Name	The full path to the PDF file, for example: C:\Data\Sample1.pdf
Handwriting Identification Mode	The name of the expected handwriting style of the text, passed as in input to the OCR engine to improve text recognition. Relevant only when OCR Mode is set to Handwriting.
Languages	The expected language(s) of the text, passed as in input to the OCR engine to improve text recognition. The default is English.
OCR Mode	The OCR mode to be used by the OCR engine
Pages Count	The number of pages in the PDF file.
Tables Count	The number of the recognized tables in the active page. This property gets the number of tables only after the table's data recognition (first run of function to get text from the table, using the known index of the table in advance).

The following functions are available:

Function	Description
Crop Image	Retrieve a specified screen element rectangle from the active page of the PDF as an Advanced Picture Object.
Determine Brightness	Determines a brightness value in percentages of a given area of Advanced PDF object by coordinates (100 is white).
Get Block Words	Retrieve all words individually from a specified block on the active page of the PDF
Get Checkmark State	Retrieve the state of the checkbox in a specified rectangle on the active page of the PDF. Returns a text value: Checked, NotChecked, Corrected (was checked but then corrected), or Not Recognized. If OCR is not installed, NotDetected is returned. See Using the OCR Get Checkmark State Function.
Get OCR Text Block	Retrieve the text from a specified block on the active page of the PDF .
Get Page Text	Retrieve the text from the active page of the PDF.
Get Suspicious Data	Retrieve all instances of suspicious data on the active page of the PDF. Optionally check suspicious words against a dictionary.
Get Table	Retrieve all text from a specified table on the active page of the PDF. Returns a list of rows, where each element of a row stores the text from the corresponding cell of the table.
Get Table Cells Rectangle	Retrieve the locations of all cells in a table. Returns a list of screen element rectangle objects, where each rectangle specifies the location of one cell.
Get Table Cells Text	Retrieve all text from a specified table on the active page of the PDF. Returns a list of text, where each element of the list stores the text from one cell of the table.
Get Word Locations	Retrieve a list of all locations of a specified word on the active page of the PDF. The locations are returned as Screen Element Rectangle objects. See Using the OCR Get Word Location Function.
Get Words	Retrieve all words individually from the active page of the PDF.
Load PDF Page Collection	Preload multiple pages, listed individually by page number, to speed up processing of large files .
Load PDF Page Range	Preload a range of pages, specified by the start page number and number of pages, to speed up processing of large files.
PDF Active Page To Picture	Populate an Advanced Picture Object variable with an image of the active page of the PDF.
Set Languages	Set the expected language(s) of the PDF text to improve character recognition.
Set OCR Mode	Sets the OCR mode used by the OCR engine. If the Default option does not work, you can try one of the following options: Text (Speed): Use this option for a text document where speed is important. Faster than Text(Accuracy), but potentially less exact. Text (Accuracy): Use this option for a text document where accuracy is important. Slower than Text(Speed), but potentially more exact. Document (Speed): Use this option for a text document with objects such as tables where speed is important. Faster than Document(Accuracy), but potentially less exact. Document (Accuracy): Use this option for a text document with objects such as tables where accuracy is important. Slower than Document(Speed), but potentially more exact. Barcode (Speed): Use this option for a barcode where accuracy is important. Slower than Barcode(Speed), but potentially more exact. Barcode (Accuracy): Use this option for a barcode where accuracy is important. Slower than Barcode(Speed), but potentially more exact. Business Cards: Use this option for business cards. Engineering Drawings: Use this option for drawings and charts. One Block: Also called Field Level Recognition. Use this to OCR a single block on the screen. Use this option together controls or with crop functions to demarcate the block.